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Abstract 

Information estimates such as the "direct method" of Strong et al. (1998) sidestep 
the difficult problem of estimating the joint distribution of response and stimulus by 
instead estimating the difference between the marginal and conditional entropies of the 
response. While this is an effective estimation strategy, it tempts the practitioner to 
ignore the role of the stimulus and the meaning of mutual information. We show here 
that, as the number of trials increases indefinitely, the direct (or "plug-in") estimate 
of marginal entropy converges (with probability 1) to the entropy of the time-averaged 
conditional distribution of the response, and the direct estimate of the conditional 
entropy converges to the time-averaged entropy of the conditional distribution of the 
response. Under joint stationarity and ergodicity of the response and stimulus, the 
difference of these quantities converges to the mutual information. When the stimulus 
is deterministic or non-stationary the direct estimate of information no longer esti- 
mates mutual information, which is no longer meaningful, but it remains a measure of 
variability of the response distribution across time. 



1 Introduction 



Information estimates are used to characterize the amount of information that a spike train 
contains about a stimulus H] . They are motivated by information theory |14j and widely 
believed to estimate the mutual information (or mutual information rate) between stimulus 
and spike train response. They are frequently calculated using data from experiments where 
the stimulus and response are dynamic and time- varying [HI [I2l [131 E] • 

For mutual information to be properly defined, see for example [S], the stimulus and re- 
sponse must be considered random, and when the estimates are obtained from time-averages, 
they should also be stationary and ergodic. In practice these assumptions are usually tacit, 
and information estimates, such as the direct method proposed by [T3], can be made without 
explicit consideration of the stimulus. This can lead to misinterpretation. 

The purpose of this note is to show that the direct method information estimate can be 
reinterpreted as the average divergence across time of the conditional response distribution 
from its overall mean; in the absence of stationarity and ergodicity: 

1. information estimates do not necessarily estimate mutual information, but 

2. potentially useful interpretations can still be made by referring back to the time- varying 
divergence. 

Although our results are specialized to the direct method with the plug-in entropy estimator, 
they should hold more generally regardless of the choice of entropy estimator. 

The fundamental issue concerns stationarity: methods that assume stationarity are un- 
likely to be appropriate when stationarity appears to be violated. In the non-stationary case, 
our second result should be of use, as would be other methods that explicitly consider the 
dynamic and non- stationary nature of the stimulus and response; see for instance [2j. 
"'^See |16j for a recent review of existing entropy estimators. 
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We begin with a brief review of the direct method and plug-in entropy estimator. This 
is followed by results showing that the information estimate can be recast as a time-average. 
This characterization leads us to the interpretation that the information estimate is actually 
a measure of variability of the stimulus conditioned response distribution. This observation 
is first made in the finite number of trials case, and then formalized by a theorem describing 
the limiting behavior of the information estimate as the number of trials tends to infinity. 
Following the theorem is discussion about the interpretation of the limit, and examples that 
illustrate the interpretation with a proposed graphical plot. 

2 Review of the direct method 

In the direct method a time- varying stimulus is chosen by the experimenter and then repeat- 
edly presented to a subject over multiple trials. The observed responses are conditioned by 
the same stimulus. Two types of variation in the response are considered: 

1. variation across time (potentially related to the stimulus), and 

2. trial-to-trial variation. 

Figure 1(a) shows an example of data from such an experiment. The upper panel is a raster 
plot of the response of a Field L neuron of an adult male Zebra Finch during synthetic song 
stimulation. The lower panel is a plot of the audio signal corresponding to the natural song. 
Details of the experiment can be found in [H]. 

Let us consider the random process {St,Rf} representing the value of the stimulus and 
response at time t = l,...,n during trial k = l,...,m. The response is made discrete 
by dividing time into bins of size dt and then considering words (or patterns) of spike 
counts formed within intervals (overlapping or non-overlapping) of L adjacent time bins. 
The number of spikes that occur in each time bin become the letters in the words, i?^ 
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corresponds to these words, and may belong to a countably infinite set (because the number 
of spikes in a bin is theoretically unbounded). In the raster plot of Figure 1(a) the time bin 
size is (it = 1 millisecond, and the vertical lines demarcate non-overlapping words of length 
L = 10 time bins. 

Given the responses {-Rf }, the direct method considers two different entropies: 

1. the total entropy H of the response, and 

2. the local noise entropy Ht of the response at time t. 

The total entropy is associated with the stimulus conditioned distribution of the response 
across all times and trials. The local noise entropy is associated with the stimulus conditioned 
distribution of the response at time t across all trials. These quantities are calculated directly 
from the neural response, and the difference between the total entropy and the average (over 
t) noise entropy is what [15] call "the information that the spike train provides about the 
stimulus." 

H and Ht depend implicitly on the length L of the words. Normalizing by L and consid- 
ering large L leads to the total and local entropy rates that are defined to be lim^^oo H{L)/L 
and limL^ao Ht{L) / L, respectively, when they exist. The direct method of fT5\ prescribed 
an extrapolation for estimating these limits, however they do not necessarily exist when the 
stimulus and response process are non-stationary. When there is stationarity, estimation 
of entropy for large L is potentially difficult, and extrapolation from a few small choices of 
L can be suspect. Since we are primarily interested in the non-stationary case, we do not 
address these issues and refer the reader to [9l [7j for larger discussion on the stationary case. 
For notational simplicity, the dependence on L will be suppressed in the remainder of the 
text. 
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The plug-in entropy estimate [T3] proposed estimating H and Ht by plug-in with the 
corresponding empirical distributions: 



mn 

t=i k=i 



and 

^ m 

P^i^) ■■= -J2hni=r}- (2) 



m 

k=l 



Note that P is also the average of Pt across t = 1, . . . ,n. So the direct method plug-in 
estimate^ of H and Ht are 

H:=-Y,Pir)logP{r), (3) 

r 

and 

^t:=-$^A(r)logA(r), (4) 

r 

respectively. The direct method plug-in information estimate is 

1 " 



n 
t=i 



3 Results 

The direct method information estimate is not only the difference of entropies shown in ([s]), 

but also a time-average of divergences. The empirical distribution of response across all trials 
^ [15] used the name naive estimates. 
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and times (1) is equal to the average of Pt over time. That is P(r) 



n ^ J2t=i Ptij) and so 



n 

n ^-^ 

t=i 



(6) 



^ IL ^ It 



(7) 



t=l r t=l r 
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1 



n 




n 



5^5^A(r)log 



(9) 



The quantity that is averaged over time in ([9]) is the Kullback-Leibler divergence between 
the empirical time t response distribution Pt and the average empirical response distribution 
P. 

Since the same stimulus is repeatedly presented to the subject, and there is no evolution 
in the response, over multiple trials, the following repeated trial assumption is natural: 

Conditional on the stimulus {St} the m trials {St,R\}, ■ ■ ■ , {St,K^} are inde- 
pendent and identically distributed (i.i.d.). 

Under this assumption l^j^i^^^y, . . . , Ij^m^^j are conditionally i.i.d. for each fixed t and r. 
Furthermore, the law of large numbers guarantees that as the number of trials m increases 
the empirical response distribution Pt{r) converges to its conditional expected value for each 
fixed t and r. Thus Pt{r) and P(r) can be viewed as estimates of Pt{r\Si, . . . , Sn), defined 



by 



Ptir\Si, ...,Sn):= P{Rl = r\S^,...,Sn) = E{Pt{r)\Si, . . . , ^4, 



(10) 



and P{r\Si, . . . , S'„), defined by 




(11) 



t=i 
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respectively. P is average response distribution across time t = 1, . . . ,n conditional on the 
entire stimulus {Si, . . . , Sn}- 

So the quantity that is averaged over time in (|9| should be viewed as a plug-in estimate 
of the Kullback-Leibler divergence between Pt and P. We emphasize this by writing 

This observation will be formalized by the theorem of the next section. For now we summarize 
the above with a proposition. 

Proposition 1. The information estimate is the time-average I = ^ Sr=i -^(-^^ll-^)■ 
This decomposition of the information estimate is analogous to the decomposition of mu- 
tual information that |6] call the "specific surprise," while "specific information" is analogous 
to the alternative decomposition, 

1 " 

I=-J2[H-Ht]. (13) 

An important difference is that here the stimulus itself is a function of time and the decom- 
positions are given in terms of time-dependent quantities. It is possible that these quantities 
can reveal dynamic aspects of the stimulus and response relationship. This will be explored 
further in Sections 13.21 and 13.31 

3.1 What is being estimated? 

There are two directions in which the amount of observed response data can be increased: 
length of time n, and number of trials m. The information estimate is the average of D{Pt\\P) 
over time, and may not necessarily converge as n increases. This could be due to {StyR^} 
being non-stationary and/or highly dependent in time. Even when convergence may occur, 
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the presence of serial correlation in D{Pt\\P) (see the autocorrelation in panel (b) of Figures 
2 for example) can make assessments of uncertainty in / difficult. 

Assuming that the stimulus and response process is stationary and not too dependent 
in time could guarantee convergence, but this could be unrealistic. On the other hand, 
the repeated trial assumption is appropriate if the same stimulus is repeatedly presented 
to the subject over multiple trials. It is also enough to guarantee that the information 
estimate converges as the number of trials m increases. We prove the following theorem in 
the appendix. 

Theorem 1. Suppose that Pf has finite entropy for allt — 1, . . . ,n. Then under the repeated 
trial assumption 

^im / = H{p) - Ij:h{p,) = I j2iH{p) - H{p,)] = ij:D{pm 

t=i t=i t=i 

with probability 1, and in particular the following statements hold uniformly for t — 1, . . . , n 

with probability 1: 

1. \ivc^^^H^H{P), 

2. \im^^^ Ht ^ H{Pt), and 

3. lim^^oo D{Pt\\P) = D{Pt\\P) fort^l,...,n, 

where D{Pi;\\P) is the Kullback-Leibler divergence defined by, 

D(P,\\P) :=^P,(r|5i,...,5„)log 

r 

and H{P) is the entropy of the distribution P , defined by 

H{P) J] P(r) log P(r). 





{r\Si,. 


■ ■ ! Sn) 


P{ 


r\Si,.. 


■ ■ , Sn) 



Note that if stationary and ergodicity do hold, then Pt for t = 1, . . . , n is also station- 
ary and ergodicj^ So its average, -P(r), is guaranteed by the ergodic theorem to converge 
pointwise to P{R\ = r) as n — oo. Moreover, if R\ can only take on a finite number of 
values, then H{P) also converges to the marginal entropy H{Rl) of Rl- Likewise, the av- 
erage of the conditional entropy H{Pt) also converges to the expected conditional entropy: 
\imn_,oo H{Rl^\Si, . . . , Sn)- So in this case the information estimate does indeed estimate 
mutual information. 

However, the primary consequence of the theorem is that, in the absence of stationarity 
and ergodicity, the information estimate / does not necessarily estimate mutual information. 
The three particular statements show that the time- varying quantities [H — Ht] and D{Pt\\P) 
converge individually to the appropriate limits, and justify our assertion that the information 
estimate is a time-average of plug-in estimates of the corresponding time- varying quantities. 
Thus, the information estimate can always be viewed as an estimate of the time-average of 
either D{Pt\\P) or [H{P) — /7(Pt)]-stationary and ergodic or not. 

3.2 The information estimate measures variability of the response 
distribution 

The Kullback-Leibler Divergence D{Pt\\P) has a simple interpretation: it measures the 
dissimilarity of the time t response distribution Pt from its overall average P. So as a 
function of time, D{Pt\\P) measures how the conditional response distribution varies across 
time, relative to its overall mean. This can be seen in a more familiar form by considering 
the leading term of the Taylor expansion. 



(14) 

ry'r\oi, . . . ,Dn) 

^Pt and P are stimulus conditional distributions, and hence random variables potentially depending on 

Si, ■ ■ ■ , Sn- 



{Pt{r\ 


5*1, . . . , 5. 




\Sl, . . . , Sn)]'^ 


P{r\ 


•S*!, . . . , Sn) 



Thus, its average is in this sense a measure of the average variabihty of the response distri- 
bution. 

It is, of course, possible that characteristics of the response are due to confounding 
factors rather than the stimulus. Furthermore, the presence of additional noise in either 
process would weaken a measured relationship between stimulus and response, compared 
to its strength if the noise were eliminated. Setting these concerns aside, the variation of 
the response distribution Pt about its average provides information about the relationship 
between the stimulus and the response. In the stationary and ergodic case, this information 
may be averaged across time to obtain mutual information. In more general settings averag- 
ing across time may not provide a complete picture of the relationship between stimulus and 
response. Instead, we suggest examining the time- varying D{Pt\\P) directly, via graphical 
display as discussed next. 

3.3 Plotting the divergence 

The plug-in estimate D{Pt\\P) is an obvious choice for estimating D{Pt\\P), but it turns out 
that estimating D{Pt\\P) is akin to estimating entropy. Since the trials are conditionally 
i.i.d., the coverage adjustment method described in [T7] can be used to improve estimation 
of D{Pt\\P) over the plug-in estimate. The appendix contains the details of this. 

Figures 1 and 2 show the responses of the same Field L neuron of an adult male Zebra 
Finch under two different stimulus conditions. Details of the experiment and the statistics 
of the stimuli are described in [8j. Panel (a) of the figures shows the stimulus and response 
data. In Figure 1 the stimulus is synthetic and stationary by construction, while in Figure 2 
the stimulus is a natural song. Panel (b) of the figures shows the coverage adjusted estimate 
of the divergence D{Pt\\P) plotted as a function of time. 95% confidence intervals were 
formed by bootstrapping entire trials, i.e. an entire trial is either included in or excluded 
from a bootstrap sample. 
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The information estimate going along with each Divergence plot is the average of the 
solid curve representing the estimate of D{Pt\\P). It is equal to 0.77 bits (per 10 millisecond 
word) in Figure 1(b) and 0.76 bits (per 10 millisecond word) in Figure 2(b). Although the 
information estimates are nearly identical, the two plots are very different. 

In the first case, the stimulus is stationary by construction and it appears that the time- 
varying divergence is too. Its fluctuations appear to be roughly of the same scale across 
time, and its local mean is relatively stable. The average of the sohd curve seems to be a 
fair summary. 

In the second case the stimulus is a natural song. The isolated bursts of the time- varying 
divergence and relatively flat regions in Figure 2(b) suggest that the response process (and 
the divergence) is non-stationary and has strong serial correlations. The local mean of the 
divergence also varies strongly with time. Summarizing D{Pi;\\P) by its time-average hides 
the time-dependent features of the plot. 

More interestingly, when the divergence plot is compared to the plot of the stimulus in 
Figure 2, there is a striking coincidence between the location of large isolated values of the 
estimated divergence and visual features of the stimulus waveform. They tend to coincide 
with the boundaries of the bursts in the stimulus signal. This suggests that the spike train 
may carry information about the onset/offset of bursts in the stimulus. We discussed this 
with the Theunissen Lab and they confirmed from their STRF models that the cell in the 
example is an offset cell. It tends to fire at the offsets of song syllables-the bursts of energy 
in the stimulus waveform. They also suggested that a word length within the range of 
30-50 milliseconds is a better match to the length of correlations in the auditory system. 
We regenerated the plots for words of length L = 40 (not shown here) and found that the 
isolated structures in the divergence plot became even more pronounced. 
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4 Discussion 



Estimates of mutual information, including the plug-in estimate, may be viewed as measures 
of the strength of the relationship between the response and the stimulus when the stimulus 
and response are jointly stationary and ergodic. Many applications, however, use non- 
stationary or even deterministic stimuli, so that mutual information is no longer well defined. 
In such non-stationary cases do estimates of mutual information become meaningless? We 
think not, but the purpose of this note has been to point out the delicacy of the situation, 
and to suggest a viable interpretation of information estimates, along with the divergence 
plot, in the non-stationary case. 

In using stochastic processes to analyze data there is an implicit practical acknowledg- 
ment that assumptions cannot be met precisely: the mathematical formalism is, after all, an 
abstraction imposed on the data; the hope is simply that the variability displayed by the data 
is similar in relevant respects to that displayed by the presumptive stochastic process. The 
"relevant respects" involve the statistical properties deduced from the stochastic assump- 
tions. The point we are trying to make is that highly non-stationary stimuli make statistical 
properties based on an assumption of stationarity highly suspect; strictly speaking, they 
become void. 

To be more concrete, let us reconsider the snippet of natural song and response displayed 
in Figure 2. When we look at the less than 2 seconds of stimulus amplitude given there, 
the stimulus is not at all time-invariant: instead, the stimulus has a series of well-defined 
bursts followed by periods of quiescence. Perhaps, on a very much longer time scale, the 
stimulus would look stationary. But a good stochastic model on a long time scale would likely 
require long-range dependence. Indeed, it can be difficult to distinguish non- stationarity from 
long-range dependence plj] , and the usual statistical properties of estimators are known to 
breakdown when long-range dependence is present [3j. Given a short interval of data, valid 
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statistical inference under stationarity assumptions becomes highly problematic. To avoid 
these problems we have proposed the use of the divergence plot, and a recognition that the 
"bits per second" summary is no longer mutual information in the usual sense. Instead we 
would say that the estimate of information measures magnitude of variation of the response 
as the stimulus varies, and that this is a useful assessment of the extent to which the stimulus 
affects the response as long as other factors that affect the response are themselves time- 
invariant. In other deterministic or non-stationary settings the argument for the relevance 
of an information estimate should be analogous. Under stationarity and ergodicity, and 
indefinitely many trials, the stimulus sets that affect the response — whatever they are — 
will be repeatedly sampled, with appropriate probability, to determine the variability in the 
response distribution, with time-invariance in the response being guaranteed by the joint 
stationarity condition. This becomes part of the intuition behind mutual information. In 
the deterministic or non-stationary settings information estimates do not estimate mutual 
information, but they may remain intuitive assessments of strength of effect. 
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A Appendix 



A.l Coverage adjusted estimate of D{Pt\\P) 

The main idea behind coverage adjustment is to adjust estimates for potentially unobserved 
values. This happens in two places: estimation of Pt and estimation of D{Pt\\P). In the 
first case, unobserved values affect the amount of weight that Pt, defined in ^ in the main 
text, places on observed values. In the second case unobserved values correspond to missing 
summands when plugging Pt into the KuUback-Leibler divergence. [I7j gives a more thorough 
explanation of these ideas. Let 

m 

Nt{r):=Y,hRi=r}- (15) 

k=l 

The sample coverage, or total Pr probability of observed values r, is estimated by Ct defined 
by 

a := 1 - *lr : N.(r) ^ 1} + ^ 

m + 1 

The number in the numerator of the fraction refers to the number of singletons — patterns 
that were observed only once across the m trials at time t. Then the coverage adjusted 
estimate of Pt is the following shrunken version of Pt'. 

Pt{r) = CtPtir). (17) 

P is estimated by simply averaging Pf. 



1 " 

P{r) = -y2Ptir) 



The coverage adjusted estimate of D{Pt\\P^ is obtained by plugging Pt and P into the 
Kullback-Leibler divergence, but with an additional weighting on the summands according 
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to the inverse of the estimated probabihty that the summand is observed: 

C(P.IIP) ^,yA(r){logP.(r) -logP(r)} 

V i-{i-p,{ryr 

The additional weighting is to correct for potentially missing summands. (This is also 
explained in detail in |T7].) Confidence intervals for D{Pt\\P) can be obtained by bootstrap 
sampling entire trials, and applying D to the bootstrap replicate data. 



A. 2 Proofs 

We will use the following extension of the Lebesgue Dominated Convergence Theorem in the 
proof of Theorem [T} 

Lemma 1. Let fm and Qm for m = 1, 2, . . . be sequences of measurable, integrable functions 
defined on a measure space equipped with measure fi, and with pointwise limits f and g, 
respectively. Suppose further that \fm\ < Qm and limm-*oo J dm dfi = J g dfi < oo. Then 



lim f^ dfi= / lim fm dfi. 
Proof. By linearity of the integral, 

liminf {g + g^) dfi - limsup \f - fm\ dn = liminf [g + g^) - 1/ - fm\ dfi. 
Since < {g + gm) — \ f — fm\, Fatou's Lemma implies 

liminf {g + gm) - \ f - fm\ dfi > limmi{g + gm) - \ f - fm\ dfi. 

n—too J J n^oa 

The limit inferior on the inside of the right-hand integral is equal to 2g by assumption. 
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Combining with the previous two displays and the assumption that J dfi ^ J 9 df^ gives 
hmsupl / fdfi- / frndju^l <hmsup / |/ - /m^/i < 0. 

ra— >oo J J n—>oo J 

□ 

Proof of Theorem [1} The main statement of the theorem is imphed by the three numbered 
statements together with Proposition [1} We start with the second numbered statement. 
Under the repeated trial assumption, Rj, . . . , are conditionally i.i.d. given the stimulus 
{St}. So Corollary 1 of [Ij, can be applied to show that 

lim Ht= lim - V log (20) 

m— »oo m— »oo ' ^ 
r 

= -J2 Pt{r\S,, ...,Sr,) \ogPt{r\S,, . . . , S^) (21) 

r 

= H{Pt) (22) 

with probability 1. This proves the first numbered statement. 

We will use Lemma [T] to prove the first numbered statement. For each r the law of large 
numbers asserts limm^oo Pt{T) = Pi(r|5'i, . . . , Sn) with probability 1. So for each r, 

lim -Pt{r) logP(r) = -Pt{r\Si, . . . , 5„) logP(r|5i, . . . , 5„) (23) 

and 

lim -A(r) log A(r) = -Pi(r|5i, . . . , 5^ logPt(r|5i, . . . , 5„) (24) 



with probability 1. Fix a realization where (20 24) hold and let 



/„(r) := -A(r) log P(r) 
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and 



Then for each r 



gm{r) := -Pt{r)[\ogPt{r) - \ogn] 



hm Ur) = -Pt{r\S,, . . . , S^) logP(r|Si, . . . , S„) =: /(r) 



and 



hm gm{r) = -Pt{r)[\ogPt{r) - logn] =: g{r) 



The sequence fm is dominated by gm because 



<-A(r) log P(r) = /„(r) 

n 

= -Pt{r)[\og'^Pu{r) -lognl 

u=l 

< -Pt{r)[\ogPt{r) - logra 
= 9m{r) 



(25) 
(26) 

(27) 
(28) 



for aU r, where (27) uses the fact that logx is an increasing function. From (20) we also have 



that hmm-»oo J2r9"i{^) = J^rdi''^)- Clearly, fm and gm are summable. Moreover H{Pt) < oo 
by assumption. So 

Y.3i^^ = Yl -^*('^) ^t(r) + log n ^ Pt(r) = H{Pt) + logn < cx) (29) 

r r r 

and the conditions of Lemma [T] are satisfied. Thus 



hm V-A(r)logP(r) = lim V/™(r) = V /(r) = V -Pi(r) log P(r). (30) 
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Averaging over t = 1, . . .n gives 



H= lim V -P(r) log P(r) = V-P(r) log P(r) = i/(P). 



(31) 



for realizations where (20-24) hold. This proves the first numbered statement because the 



probability of all such realizations is 1. 

For the third numbered statement we begin with the expansions 



Z)(Pt||P) = J]A(r) log Pi(r) -A(r) log P( 



(32) 



and 



D(Pi||P) = 5^Pt(r) logPi(r)-Pi(r)logP( 



(33) 



The second numbered statement and (30) imply 



lim V A (r ) log A (r) - A (r) log P(r) = V Pi(r) log Pt(r) - V Pi(r) log P(r) (34) 



with probability 1. This proves the third numbered statement. 



□ 
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Figure 1: (a) Raster plot of the response of the a Field L neuron of an adult male Zebra 
Finch (above) during the presentation of a synthetic audio stimulus (below) for 10 repeated 
trials. The vertical lines indicate boundaries of L = 10 millisecond (msec) words formed at 
a resolution oi dt — 1 msec. The data consists of 10 trials, each of duration 2000 msecs. (b) 
The coverage adjusted estimate (solid line) of D(Pt, P) from the response shown above with 
10 msec words. Pointwise 95% confidence intervals are indicated by the shaded region and 
obtained by bootstrapping the trials 1000 times. The information estimate, 0.77 bits (per 
10msec word, or 0.077 bits/msec), corresponds to the average value of the solid curve. 
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Figure 2: (a) Same as in Figure 1, but in this set of trials the stimulus is a conspecific 
natural song, (b) The coverage adjusted estimate (sohd hne) of D{Pt, P) from the response 
shown above. Pointwise 95% confidence intervals are indicated by the shaded region and 
obtained by bootstrapping the trials 1000 times. The information estimate, 0.76 bits (per 
10 msec word or 0.076 bits/msec), corresponds to the average value of the sohd curve. 



21 



